A Curated Rotamer Library for Common Post-Translational Modifications of Proteins

Sidechain rotamer libraries of the common amino acids of a protein are useful for folded protein structure determination and for generating ensembles of intrinsically disordered proteins (IDPs). However much of protein function is modulated beyond the translated sequence through thFiguree introduction of post-translational modifications (PTMs). In this work we have provided a curated set of side chain rotamers for the most common PTMs derived from the RCSB PDB database, including phosphorylated, methylated, and acetylated sidechains. Our rotamer libraries improve upon existing methods such as SIDEpro and Rosetta in predicting the experimental structures for PTMs in folded proteins. In addition, we showcase our PTM libraries in full use by generating ensembles with the Monte Carlo Side Chain Entropy (MCSCE) for folded proteins, and combining MCSCE with the Local Disordered Region Sampling algorithms within IDPConformerGenerator for proteins with intrinsically disordered regions.


INTRODUCTION
Post-translational modifications (PTMs) refer to the chemical modifications that are made to the amino acids of a protein after translation to enable the cell to regulate its function.The importance of PTMs can't be understated given that at least 80% of mammalian proteins are modulated by PTMs, influencing essentially all biological processes, including cell signaling and metabolic pathways, transcriptional regulation, and DNA repair.In addition, dysregulation of PTMs is implicated in the development and progression of many diseases including cancer. 1 In spite of the fact that there are hundreds of PTMs known to occur in biology, 2 the available experimental structural data, e.g. from X-ray crystallography, cryoelectron microscopy (Cryo-EM) or Nuclear Magnetic Resonance (NMR) measurements, remains sparse across the full space of chemical modifications compared to unmodified proteins.
In principle, computational models can fill the gap for modeling the protein structural changes introduced by PTMs, although most structure-based prediction algorithms are primarily designed for the canonical amino acids.For example, while the original AlphaFold2 3 and RosettaFold 4 have revolutionized protein structure prediction for unmodified sequences, they did not handle PTMs.][10][11][12] PTMs modulate protein energy landscapes, which can lead to changes in conformations and in their dynamic interconversions.Protein flexibility also leads to the fluctuations of side chain packing arrangements that often have a direct functional role [13][14][15][16][17] .A large number of high-quality physical algorithms [18][19][20][21][22][23] and machine learning approaches [24][25][26] exist to perform side chain repacking for the canonical amino acids for folded proteins, disordered proteins when they undergo binding-upon-folding, or when they form dynamical complexes.
However, the ability to model side chain ensembles with PTMs are currently quite limited.There have been a small number of studies for computational modeling of PTMs 27 ; software packages such as Rosetta 28,29 , FoldX 30 , and SIDEpro 25 have the ability to model PTMs using rotamer libraries 31 .A recent method named GlycoSHIELD exclusively focuses on the modeling of glycosylation by grafting glycan conformer candidates sampled from molecular dynamics trajectories 33 .Given that experimental PTM structural data has continued to accumulate since many of these methods have been developed, it is worth creating new side chain rotamer libraries containing PTMs.Furthermore, since some rotamer libraries such as SIDEpro only characterized the effect of the PTM modifications on side chain torsion angles, it is also valuable to account for the rotamer distribution shifts of the backbone dihedral angles.
In this study, we have undertaken the creation of both backbone-dependent and backboneindependent rotamer libraries to comprehensively investigate the influence of PTMs on sidechain conformational ensembles.The decision to generate both types of libraries arises from the goal of capturing nuanced details of sidechain variability within the local structural context provided by the protein's backbone, as well as to understand more general patterns of sidechain conformations across diverse protein structures where data sparsity is less of a concern.We compare our rotamer libraries against SIDEpro 25 and Rosetta 28,29 , and show that the overall trend across various metrics show systematic improvements against folded proteins and IDP compared to these standard methods.

Structural Data for post-translational modifications of amino acids.
To generate structured data for PTMmodified amino acids to construct our datasets, we first accessed the PTM Structural Database (PTM-SD). 32ach data entry in the PTM-SD corresponds to a distinct post-translational modification on a specific amino acid residue.Within the PTM Structural Database, an inquiry targeting these amino acids and their associated modifications facilitated the retrieval of RCSB Protein Data Bank (PDB) identification codes for each modified residue.
Utilizing the obtained PDB identification codes, we conducted searches within the PDB database 33 to acquire structural data for the corresponding modified amino acids.A notable observation was that a subset of proteins frequently exhibited high sequence similarity and identity.Such a high degree of redundancy would introduce biases that could impact the interpretation of patterns and relationships in the rotamer data.To address this, the independent NCBI BLAST tool 34 was employed to detect identical sequences.Through the alignment of FASTA sequences from all proteins within the dataset, a threshold for sequence similarity of 90% or above was established, resulting in the clustering of protein structures.Subsequently, within each group, the structure having the highest resolution was selected for further analysis.Furthermore, PDB files with incomplete structural data were excluded.
Supplementary Tables S1 and S2 provide the curated PDB dataset for PTM-containing amino acids, further delineated by the refined resolution ranges for structures including each modified residue type, and the number of PDB files available in each resolution category.We found PDB files for phosphorylated serine (SEP), phosphorylated threonine (TPO), phosphorylated tyrosine (PTR), methylated arginine (AGM), mono-methylated lysine (MLZ), di-methylated lysine (MLY), tri-methylated lysine (M3L), oxidized methionine (OMT), and acetylated lysine (ALY) (see Supplementary Table S3).However, we chose to perform our analysis and rotamer generation to create PTM rotamer libraries only on modified amino acids having sufficient data by defining a resolution cutoff of 3.5 Å and lower, and requiring a minimum of 40 total PDB files.Based on these criteria, we only developed rotamer libraries for SEP, TPO, PTR, M3L, and ALY.
Statistical Analysis for PTM rotamer libraries.Given that the modified amino acids exhibit a nonuniform distribution of backbone phi (ϕ) and psi (ψ), and the relative sparsity of the observations of side chain rotamer chi (χi) for specific backbone dihedral angles, we use the adaptive kernel density estimation of Shapovalov and Dunbrack to estimate the rotamer probability density functions (PDFs) 35 .This method determines the width of the kernel based on the local density of data points, such that in denser regions narrower kernels are applied to capture more local variations while broader kernels create smoother distributions in regions where rotamer occupancy is sparse.
We employ the von Mises kernel 36 which is well-suited to periodic data such as dihedral angles.The Nadaraya-Watson kernel regression model 37,38 considers the influence of neighboring data points in a weighted manner, producing smoothed estimates of the mean and standard deviation values for χ angles within each ϕ/ψ bin.This smoothing helps mitigate the impact of noise and fluctuations in the data, providing a more robust and reliable characterization of the relationships between the backbone ϕ/ψ and sidechain χ dihedral angles.Bayes' rule is applied to these rotamer probability density functions to acquire the rotamer probabilities.
Backbone-dependent (BD) PTM rotamer libraries.In the construction of the BD-rotamer library, we discretized the backbone space into bins, each spanning a 30° interval.This level of granularity allows for a detailed representation of the local backbone geometry and facilitates the accurate prediction of sidechain conformations.The 30° bin size was chosen to balance computational efficiency with the need to capture subtle variations in sidechain orientations influenced by the local backbone structure.Regarding discretized bins which lack occupancy in the rotamer library, the probability and χ angle means and standard deviation are estimated using the backbone-independent (BI) probability distributions.
While canonical amino acids fit well into this discretization strategy, we find that PTMs sometimes do not fit into standard categories.To accommodate non-standard rotamer angle distributions observed for the χ2 of SEP and TPO, and χ3 of PTR, we split the probability distributions by their population density with clusters defined using DBSCAN 43 .Furthermore, as the terminal χ angles of phosphorylated SEP, TPO, PTR, and the methylated M3L have a 3-fold symmetry around the torus axis and show broad distributions regardless of the other sidechain angles, we estimated their means and standard deviations by only aligning with the conformer ranges of the angles themselves (Supplementary Figure 2).
Creating protein structures using PTM rotamer libraries for folded proteins and IDPs.We use our newly generated PTM libraries by generating side chain ensembles for folded proteins with the Monte Carlo Side Chain Entropy (MCSCE). 18The MCSCE algorithm is also embedded within the Local Disordered Region Sampling (LDRS) 39 algorithms within IDPConformerGenerator 40 for proteins with intrinsically disordered regions (IDRs).The backbone conformers of disordered regions were aligned and attached to folded domains using the LDRS module in IDPConformerGenerator, discarding conformations with steric clashes.For each disordered region case, we sampled 20,000 backbone conformations from a combination of loop, helices, and β-strands regions; successful and complete sidechain packing statistics are given in Supplementary Table S4.
Candidate sidechain packings were generated with Monte Carlo-Side Chain Entropy (MCSCE) 18 program for the disordered regions of the unmodified sequence using the Dunbrack library 39,44 , and with PTMs using the BD-rotamer and BI-rotamer libraries we developed in this work.We also used MCSCE to generate sidechain packings using the SIDEpro 25 PTM rotamer libraries.As the SIDEpro rotamer library only defines the additional sidechain angles for PTMs conditioned on the rotamer ranges of the existing angles in the unmodified amino acid residues, these sidechain angles were sampled according to the probability distribution of the corresponding canonical amino acids.In both cases, the rotamers of the folded domains are unchanged.The Rosetta 28,29 packed conformers were generated by invoking the fixbb application starting from the same backbone conformers attached to the folded domains, similarly with the folded domains held fixed.As Rosetta's scoring function during conformer packing does not define an explicit clash cutoff as in MCSCE, the Rosetta generated conformers were also filtered with the same clash criteria in MCSCE for comparison.

RESULTS
We first consider a BD-rotamer library for the SEP, TPO, PTR, M3L, and ALY that categorize sidechain conformations based on specific backbone ϕ and ψ dihedral angles.As shown by Dunbrack and coworkers 41 , a BD library for predicting side chain conformations produces much better results for protein structure refinement using NMR and X-ray data as opposed to a BI-rotamer library.The BI-rotamer library focuses solely on the distribution of sidechain dihedral angles, and for PTMs may be a necessity if the data is too sparse to differentiate it from the BD-rotamer case.
The influence of backbone conformations on PTM sidechain rotamer states is illustrated in Figures 1 and 2 (and Supplementary Figures S1-S3).In Figure 1, the χ1 of PTM-modified amino acids show distinct rotamer populations in different ϕ/ψ regions (and in relation to secondary structure), and more importantly the χ1 distributions for the PTM-modified amino acids also shift considerably from the unmodified canonical residues in these regions.For example, while the χ1 of SER, THR and LYS show similar populations regardless of ϕ/ψ ranges, the χ1 of SEP, TPO and M3L have visibly different preferences for certain backbone regions (Figure 1).This illustrates that the SIDEPro rotamer library that utilizes an unmodified χ1 will be unable to fully capture these structural changes exhibited by the PTMs.The χ2 of the phosphorylated amino acids SEP and TPO adopt rotamers that cannot be easily explained using bond hybridization models, but instead arise from favorable hydrogen bonding interactions 45 (Figure 2).In particular, TPO has a dominating χ1/χ2 population centered at -60°/120°, a configuration that encourages formation of an internal P-O/N-H hydrogen bond (Figure 2B).χ2 of SEP exhibits a mixed population that spans the angular range from 60° to -120°.In addition to the χ1/χ2 -60°/±120° configuration associated with an internal hydrogen bond (Figure 2-A1), we observe the formation of a hydrogen bond between P-O of the SEP residue i and N-H of its adjacent residue i+1 with a χ1/χ2 configuration around -180°/60° (Figure 2-A2).These varieties of sidechain rotamer states are consistent with the cooperative transition between a state in which the phosphate group is well-solvated and a state that forms intra-and inter-residue hydrogen bonds, as noted in reference (45).As shown in Figure 2 (and Supplementary Figures S1-S3), the observed higher percentage of hydrogen bond formations in the polyproline helical backbone region, and the resulting selectivity in the χ1/χ2 configurations, help rationalize the distinct rotamer populations for phosphorylated amino acids such as SEP and TPO that are dependent on backbone configurations.All these differences highlight the need to better characterize the rotamer states of PTM-modified amino acids apart from the canonical amino acids.We verify the fidelity of our constructed BD-rotamer and BI-rotamer libraries for PTMs by sampling their sidechain torsion distributions to generate new side chain packings and compare the resulting accuracies of the repacked structures to the experimental PDB data.In Supplementary Figure S4, we show that the sidechain rotamer distribution of the constructed libraries are in good agreement with the PDB data when sampled based on the same backbone conformation.When repacking PTMs sidechains, we removed the original PTM sidechain structures and then resampled the rotamers for PTMs with all other residues intact.We compare the RMSD values of the full protein structures with PTMs, regenerated with the BDrotamer and BI-rotamer libraries from this work against the SIDEpro and Rosetta libraries, and comapre the distributions as boxplots in Figure 3A.
It is evident that the BD rotamer library has a smaller interquartile range of somewhere between 0.25 Å to 1.0 Å and is skewed toward lower RMSD values across all of the PTM types compared to the other rotamer libraries (Figure 3A and Supplementary Table S5).The BI-rotamer library for PTMs also demonstrates trends in improvement to the other standard methods, with the exception of TPO, and the statistical significance of improvement is not as strong compared to the BD-rotamer library.We also performed a Wilcoxon Signed Rank test to evaluate the RMSD differences between each distribution pair with a 95% confidence level, in which the p-values from this test are in Supplementary Figure S5 show that the BD-rotamer library results in a statistically meaningful decrease in RMSDs compared to existing methods such as SIDEpro and Rosetta.For all PTM types considered in this work, the PTM BD-rotamer provides excellent performance in recovering the experimental structures; Figure 3B provides examples of repacked PTM-modified structures using all four methods and compared to experimental PDB structures.To demonstrate how our PTM libraries can support ensemble generation for disordered proteins, we considered two cases for which the proteins contain IDRs.The Histone H3 N-terminal IDR within the nucleosome structure (chains C and G, PDB ID 8SIY) have 4 of the 5 types of PTMs investigated here, with methylation at K9, phosphorylation at S10 and T11, and acetylation at K14 (Figure 4A).The ubiquitin recognition factor in ER-associated degradation protein 1 (UFD1) with phosphorylation at Y219 (PDB ID 2YUJ) is shown in Figure 5A.For both cases we compare ensembles generated with and without PTMs using our BD-rotamer and BI-rotamer libraries as well as libraries from SIDEpro and Rosetta, although SIDEpro only contains M3L, SEP and TPO since ALY is not supported by that method 25 .We also compared against results using the original Rosetta scoring function that ignores steric clashes for the Histone H3 nucleosome structure.Supplementary Figure S6 shows that the Ramachandran plots of the sidechain rotamer states with PTMs do not change for IDRs, nor among libraries, and only small changes are observed in secondary structure between rotamer libraries as seen in Supplementary Figure S7.This indicates that the structural changes are concentrated in any differences in side chain packing.
Figures 4C and 5C show that the torsional properties of the PTM IDR ensembles are different to IDR ensembles without sidechain modifications at the modification sites.Furthermore, the sidechain rotamer state changes are reflected differently for the BD-, BI-, SIDEpro, and two Rosetta-rotamer ensembles.To better analyze what are the structural consequences of the IDR ensembles generated with the different PTM rotamer libraries, we constructed 2D contact maps subtracting the values from the ensemble without PTMs (Figures 4D and 5D) to look for increases (red) or decreases (blue) of residue-residue contacts being made.For Histone H3, residues 5-14 show a higher population of contacts with the BDrotamer library, which we see corresponds to a much denser hydrogen-bonded network on average in this area (Figure 4B).A similar observation is found for UDF1 using the BD-rotamer library, and although it is only a single modification spot at residue 219, the PTM introduces a network of hydrogen bonds (example shown in Figure 5B).Especially for Histone H3, the other rotamer libraries show a smaller set of contacts or even net loss of contacts in this same region.In turn the BD-rotamer generated structures show a diminishment of contacts made by these same PTM residues with other regions of the protein, an effect which is muted in the other libraries.For the BI-rotamer library this is due to over-averaging, whereas for SIDEPro the differences with the unmodified ensembles is because it uses the same χ1 as the unmodified residues.It is interesting that the Rosetta rotamer-library combined with no clash criteria gives a nearly opposite trend in sidechain packing for residues with PTMs.Given that repacking structures on the folded protein backbone showed greater reliability with the BD-rotamer set, we extrapolate that there is better justification for supporting the structural consequences observed for the Histone H3 and UDF1 IDRs.

DISCUSSION AND CONCLUSION
Modeling sidechain conformations for PTMs have important applications in understanding protein conformational switches, including changes in dynamics and disordered protein interactions, leading to functional consequences modulated by PTMs.Although publicly available high-resolution structures including residues modified by many of the known PTMs are still limited, there is enough structural data for residues modified by phosphorylation, methylation, and acetylation.Thus, we have developed BDrotamer and BI-rotamer libraries for PTM-modified residues, using structural data curated and cleaned from the recent PTM-SD update.
We evaluate the constructed libraries in comparison to SIDEpro and Rosetta in the context of generating conformers for folded domains and for proteins with disordered regions.The BD-rotamer libraries outperform the BI-rotamer libraries as well as SIDEpro and Rosetta in retrieving the experimental structures for PTMs on the folded proteins, as evidence of its ability to capture the correlation more accurately between backbone and sidechain, and within sidechain dihedral conformations.For phosphorylated residues specifically, the ability to predict the sidechain rotamer cooperatively is crucial to modeling local hydrogen bonding interactions, thus allowing one to better delineate the structural features of single conformer and ensembles upon chemical modifications.We also show that the constructed PTMmodified residue rotamer libraries can be used along with MCSCE and IDPConformerGenerator to produce all-atom conformers for disordered proteins with PTMs.If experimental data such as nuclear magnetic resonance, small-angle X-ray scattering, and fluorescence resonance energy transfer are available, a reweighting protocol using X-EISD 46 or evolving underlying ensembles to agree with experimental data using DynamICE 42 can be applied to these all-atom sidechain-modified conformers to generate more realistic ensemble representations to investigate PTM-regulated activities and interactions.

Figure 1 .
Figure 1.Ramachandran plots color coded by χ1 angle ranges for PTM-modified amino acids using the backbonedependent library.Backbone bins in 60° separation are shown in gray dash lines.

Figure 2 .
Figure 2. Ramachandran plots color coded by χ1 and χ2 angle ranges for SEP and TPO with hydrogen bond formation using the backbone-dependent library.Backbone bins in 60° separation are shown in gray dash lines.Hydrogen bonds are defined by within a donor-acceptor distance cutoff of 3.5 Å. A1, A2 and B show representative configurations for SEP (A1, A2) and TPO (B) associated with the annotated backbone regions.

Figure 3 .
Figure 3. RMSD distributions of repacked PTM-modified residues using different rotamer libraries to experimental structures.SIDEpro does not support ALY packing.A).Boxplots for the RMSD distributions.Medians are highlighted in white and each box extends from the first quartile (Q1) to the third quartile (Q3).Outliers in circle are defined as points outside of 1.5 times the interquartile range below Q1 or above Q3.B).Repacked PTM-containing structures using different rotamer libraries compared to the experimental PDB structures.Examples are taken from PDB ID 3CLY (PTR), 4EZH (M3L) and 4QUT (ALY).

Figure 4 .
Figure 4. Histone H3 conformers generated with different PTMs libraries.A).Modifications on the histone H3 Nterminus on chains C/G of the nucleosome.B).Ensembles of 30 all-atom H3-modified nucleosome conformers (folded domains and DNA are taken from PDB ID 8SIY).The N-terminal IDRs on chain C and G are shown in salmon (loop) and cyan (helices), and folded domains (yellow).PTM-modified residues are highlighted with stick representations.C).Comparison of torsion angle probability distributions for PTM-modified residues in the H3 conformers with different libraries.D).Fractional inter-residue contacts (Ca-Ca distances within 8 Å) of PTMs containing ensembles (dptm) subtracted by the ensemble without modifications (d0) for the IDR regions.The maps were calculated with 500 randomly sampled conformers (based on convergence of Rgand averaged over 10 trials (blue-less, red-more).From left to right: BD-rotamer, BI-rotamer, SIDEpro, Rosetta and Rosetta without clash filtering, with contacts averaged from chain C and G.

Figure 5 .
Figure 5. UDF1 conformers generated with different PTM libraries.A).Modifications on the C-terminal IDR of UDF1 at Y219.B).Cartoon representations of 30 all-atom UDF1 conformers (structure of the folded domain taken from PDB ID 2YUJ model).The C-terminal IDR is shown in salmon red (loop) and cyan (helices), and the folded domains in yellow.C).Comparison of torsion angle probability distributions for PTM in the UDF1 conformers with different libraries.D).Fractional inter-residue contacts (Ca-Ca distances within 8 Å) of PTMs containing ensembles (dptm) subtracted by the ensemble without modifications (d0) for the IDR region around the PTM site.The maps were calculated with ensembles of 500 randomly sampled conformers and averaged over 10 trials.Top: BD-rotamer, BIrotamer; Bottom (L to R): SIDEpro, Rosetta and Rosetta without clash filtering.

Figure S3 .
Figure S3.Terminal torsion angle distributions for the curated PTM-modified residues.Rotameric bin definitions for tetrameric (sp 3 bond hybridization) conformations are marked in gray dash line.

Figure S4 .
Figure S4.Comparison of rotamer probability distributions for PTM-modified residues from curated PDB data and constructed libraries.All angles are sampled using the same set of ϕ/ψ, and probability densities are estimated with von Mises kernel.

Figure S5 .
Figure S5.Wilcoxon signed-rank tests for the RMSD distributions of repacked PTM-modified residues to experimental structures.The one-tailed p-values smaller than 0.05 indicate that the of median of the pairwise difference between the bottom and left distribution is smaller than 0 with a 95% confidence level (highlighted in blue).

Figure S6 .
Figure S6.Ramachandran plots of PTM-modified residues in the generated IDR-containing Histone H3 and UDF1 conformers with IDRs using different rotamer libraries.

Figure S7 .
Figure S7.Histone H3 and UDFI conformers generated with different PTMs libraries.Fractional secondary structure propensities for (A) histone H3 N-terminal IDRs on chain C and G and (C) UDFI.Secondary structures were calculated using DSSP.Helices are shown in solid line, loop in dotted line and β-sheet in loosely dashed line.Mean radius of gyration over ensemble size, calculated from an average of 30 randomly selected ensemble for (B) histone H3 N-terminal IDRs on chain C and G and (D) UDFI